Okay, let's start.
You can still hear me, right?
Okay, good.
So we've been talking about natural language as an area of AI, and in particular in kind
of in sync with our weak AI mandate, namely instead of going after general AI, which is
all the things that humans can do, we go after kind of little problems.
And that's called language, natural language processing.
We've seen a couple of those things and we've looked at language models.
And the idea there is that instead of having a true-false verdict on whether a string is
in the language or not, we make a probability distribution.
And basically that's an area called corpus linguistics.
The idea there is that rather than some person who knows English or German thinking deep
and saying, well, this is how it is in English or German, we just basically collect a corpus
of English and start counting.
Basically do it scientifically by data-driven research.
And what this gives is essentially a variety of models that are probability-based and they're
sampled from a corpus essentially by sophisticated counting.
And we've basically seen that if we think of a sequence of words as a Markov sequence,
not necessarily Markov chain but something higher order typically, then we can just basically
look at probabilities of small subsequences, unigrams, bigrams, trigrams.
Usually we're not going much higher than trigrams because things get big.
And so the things we can do with that is language identification, which is essentially if we
have a couple of, say, character trigram distributions, we'll just run them over our language sample
and see which one fits best.
Very simple, very effective, very useful.
Other things are genre classification, which is done that way, named entity recognition
is an important subtask.
And we've talked about that in a little bit of detail.
We can do word n-grams, but of course we have a data problem there.
Same with the internet, if we want to do word trigrams, we need a lot of data.
And so we also need to talk about words that we don't have in the vocabulary, which are
misspellings, new words, special words, things that are dialect words and so on that nobody
else knows, those kind of things.
And we essentially condense them down to kind of special tokens in the pre-processing step
and then we can deal with them.
Of course, if we're condensing lots of stuff into single tokens, then we're actually destroying
information.
And so there's a tension between simplifying a lot and maybe throwing away too much data
there.
Out of vocab words is not something that we have a lot of problems with in most languages
if we do character models.
At least when we have written language.
Using my handwriting, I'm sure there's lots of out of character characters, out of vocabulary
characters that you just aren't sure what they are.
But typically that's not what we're looking at.
But for words it's already a problem.
And the last thing we looked at was a measure.
That's an LLP.
It's very important to measure things.
How well are we doing?
Presenters
Zugänglich über
Offener Zugang
Dauer
01:33:33 Min
Aufnahmedatum
2023-07-11
Hochgeladen am
2023-07-12 16:09:06
Sprache
en-US